Cox regression modeling of survival after chemotherapy for colon cancer
I. Introduction
The Cox Proportional Hazards (PH) model is one of the most widely used regression models used for survival analysis, also known as time-to-event analysis. The model investigates the association between survival time and one or more predictors. The Cox PH model is based on the hazard function, which is defined as the probability that an individual will experience the event at a given time, provided that the event has not occurred yet [1]. The outcome variable is the survival time, which is measured from a defined time until the occurrence of an event, such as death, or the end of the study period [1]. For each subject, there is time to event (T), which is the duration from a defined starting point to the event occurrence, or a censoring time (C), which is the duration until the subject drops out of the study or the study ends.
The Cox Proportional Hazards (PH) model relies on two main assumptions: 1) random censoring, and 2) the proportional hazard assumption [2]. The random censoring assumption is satisfied when patients censored in the study are a random sample from the study population. There is no statistical test for this assumption, but it can be ensured through rigorous experimental design. The proportional hazard assumption requires that the hazard function (hazard ratio) for two groups (e.g., experimental arm and standard arm) remains proportional, meaning the hazard ratio is constant over time. This assumption is violated if a covariate’s impact on the outcome changes over the follow-up period, a common occurrence in biochemical research. When the proportional hazard assumption is violated, various methods can address this, including stratifying the model based on variables that violate the assumption. This stratification allows the baseline hazard to differ across strata [3]. Another method involves including interactions between covariates and time, enabling hazard ratios to vary over time [4]. While baseline hazards are allowed to differ, the effects of other covariates are presumed consistent across strata. A limitation of this approach is that the effect of the stratification variable cannot be tested, as it is not included as a covariate in the model. Other statistical methods to accommodate non-proportional hazards are detailed and discussed elsewhere [5].
One of the properties of the Cox PH model contributing to its popularity is that the baseline hazard, i.e., \(h_0\), is an unspecified field, which makes it a semi-parametric model. It is robust and can provide reliable estimates without specifying the baseline hazard function. Even though the baseline hazard is not known, it is possible to estimate the coefficients in the exponential part of the equation. Hence, the effect of variables included in the model can be estimated [6]. Another feature of the Cox PH model is its interpretability. The effect of covariates on the hazard is represented by hazard ratios, i.e., the relative likelihood of an event happening in the experimental group with respect to the standard group.
The Cox PH model is widely used across various fields, including medical research, epidemiology, business, engineering, and social sciences [6]. It is commonly used to determine survival rates among cancer patients with different subtypes of cancer [7], or between those treated with different treatments [8]. The Cox PH model has also been used effectively in non-medical research, for example, in the prediction of time to loan defaults [9], customer time to churn [10], product lifespan and failure time analysis in engineering [11], and for the identification of factors that influence the time it takes to achieve certain rates of success, such as vaccination rates, in public health [12].
The current analysis evaluated data from one of the first successful trials of adjuvant chemotherapy for colon cancer. The colon cancer patients in the study had their primary treatment of surgery and the objective was to test whether treatment with either Levamisole or Levamisole in combination with Fluorouracil (5-FU) chemotherapeutic agents improves survival. Levamisole is a low-toxicity compound previously used to treat worm infestations in animals; 5-FU is a moderately toxic chemotherapy agent. There are two records per person, one for recurrence and one for death. The purpose of this project is to compare survival between the untreated (Obs) group vs. those treated with amisole (Lev), or amisole + 5-FU.
The current analysis evaluated modified data from one of the first successful trials of adjuvant chemotherapy for colon cancer. The study involved colon cancer patients who had undergone primary surgical treatment, and the objective was to test whether treatment with either Levamisole alone or Levamisole combined with Fluorouracil (5-FU) chemotherapeutic agents improves survival. Levamisole is a low-toxicity compound previously used to treat worm infestations in animals; 5-FU is a moderately toxic chemotherapy agent. There are two records per person, one for recurrence and one for death. The purpose of this project is to compare survival between the untreated (Obs) group vs. those treated with amisole (Lev), or amisole + 5-FU.
II. Methods
The Cox proportional hazards model was used to model the relationship between survival time and different colon cancer treatments. In particular, the survival time between the untreated group (observation) and those treated with amisole (Lev) or amisole + 5-FU was compared. The Cox regression model was chosen for this study because it is well-suited for studying the association between survival time of patients and predictors, and it allows estimating the risk or hazard of death increased or decreased due to each treatment relative to no treatment. The time (in days) until the event, i.e., death, was modeled as a function of treatment and other variables, including age, sex, and various tumor characteristics. Significant predictors were included in the final model.
IIa. Data Source
The colon cancer cancer dataset is a built-in dataset in the Survival R package [13] [14]. It is closely aligned with to the original study published in 1991 [15]. The dataset includes 929 subjects with stage B or C colon cancer, who were randomized into three treatment groups: Observation, Levamisole (Lev), or Levamisole + 5-FU. While the dataset contains survival and recurrence information for these individuals, the current analysis focused on time to death, so rows representing recurrence time were filtered out. Time to death is given in days. The median follow-up period was 1,976 days (5.4 years), with a range of 23-3,329 days. The dataset includes various patient characteristics, such as demographics and tumor details, as well as the duration from surgery to trial registration, categorized as either short or long.
Table 1: Description of variables included in the dataset
| id: | id |
| study: | 1 for all patients |
| rx: | Treatment - Obs(ervation), Lev(amisole), Lev(amisole)+5-FU |
| sex: | 1=male |
| age: | in years |
| obstruct: | obstruction of colon by tumour |
| perfor: | perforation of colon |
| adhere: | adherence to nearby organs |
| nodes: | number of lymph nodes with detectable cancer |
| time: | days until event or censoring |
| status: | censoring status |
| differ: | differentiation of tumour (1=well, 2=moderate, 3=poor) |
| extent: | Extent of local spread (1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures) |
| surg: | time from surgery to registration (0=short, 1=long) |
| node4: | more than 4 positive lymph nodes |
| etype: | event type: 1=recurrence,2=death |
IIb. Exploratory data analysis
The data was evaluated for missing values, duplicate entry, and outliers. The Survminer [16] package was used to plot Kaplan-Meier curves to visualize survival probability over time for each of the categorical variables. Multicolinearity was tested using Variant Inflation Factor (VIF) calculated using MASS package [17].
IIc. Statistical analysis
The R statistical software version 4.3.2 [18] was used for all analysis. The Survival package was used to construct the Cox regression model [13] [14].
Cox regression model is based on the hazard function \(h_x(t)\) with covariates at time t given by:
\(h_x(t)=h_0(t)\exp(\beta_1x_1 +\beta_2x_2 + \dots + \beta_p x_p)\)
Where:
\(h_x(t)\) is the hazard function
\(h_0(t)\) is the baseline hazard function
\(\beta_1x_1 + \beta_2x_2 + \dots +\beta_p x_p\) represent the linear combination of covariates and their coefficient
The hazards for the observation vs. amisole (Lev), or amisole + 5-FU group with covariate values x1 and x2 are given by: \(hx_1(t)=h_0(t)\exp(\beta_1x_1)\) and \(hx_2(t)=h_0(t)\exp(\beta_2x_2)\), respectively
The hazard ratio is expressed as: HR = \(hx_2(t)\) / \(hx_1(t)\) = \(\exp[\beta(x_2-x_1)]\)
The R MASS package was used for Step-wise variable selection, using “both” forward and backward variable selection [17]. For Step-wise selection, stepAIC() function uses AIC (Akaike Information Criterion) as the measure to add or remove predictors from the model.
IId. Model evaluation
Diagnostic for proportional hazard assumption
The Schoenfeld residual plot was constructed to test Cox proportional hazards assumption. When the proportional hazards assumption was not met for any of the covariates, stratification approach was used.
K-fold cross-validation
Model performance was evaluated using 5-fold cross-validation using the boot package [19] [20].
Model selection
The Concordance (c-index), AIC, and Baysian Information Criteriod (BIC) of corresponding models were compared to select the best fit model. The model with the lowest AIC and BIC values and highest concordance was selected as the final model.
III. Analysis and Results
IIIa. Data cleaning and feature engineering
The dataset contained missing values in the number of nodes (nodes) and differentiation (differ) columns. Missing values in the differentiation and nodes columns were replaced with the mode and median, respectively. Categories representing different groups in the differentiation, local spread, and time to surgery (surg) columns were replaced with descriptive names. Evaluation of the nodes and node4 variables showed that samples with more than four lymph nodes (node4) had fewer than five counts in the nodes column, indicating inconsistency between the two columns. Therefore, the nodes column was not used for further analysis.
IIIb. Exploratory data analysis
Figure 1: Exploratory data analysis
Exploratory data visualization (Figure 1) showed that the study participants were equally randomized into three treatment groups, with approximately 33% of individuals in the observation, Levamisole (Lev), or Lev + 5-FU groups. Fifty-two percent of the participants were male. The dataset is balanced with respect to the outcome (status), with 51% censored and 49% having died before the end of the follow-up period.
Regarding tumor characteristics, obstruction of the colon by the tumor (obstruct), perforation of the colon (perfor), and tumor adherence to nearby organs (adhere) were observed in 19%, 3%, and 15% of the patients, respectively. About 27% of the participants had more than four tumor-positive lymph nodes (node 4) and a long time from surgery to registration (surg). Overall, most categories of tumor characteristics are well represented, with about 15% of the participants having these characteristics, except for perforation, differentiation, and local spread.
Age is normally distributed. Evaluation of time to death or censoring showed a bimodal distribution. Stratifying the time variable by outcome (status) revealed that the majority of the patients died within three years of the study period.
Table 2: Patient characteristics
| Observation (%) | Amisole (%) | Amisole + 5-FU (%) | ||
|---|---|---|---|---|
| N=315 | N=310 | N=304 | ||
| Demographics | ||||
| Male | 166 (52.3) | 177 (57.1) | 141 | |
| Median age (years) [IQR] | 60 [53,68] | 61 [53,69] | 61 [52,70] | |
| Cancer characteristics | ||||
| Colon obstruction | 63 (20.0) | 63 (20.3) | 54 (17.8) | |
| Colon perforation | 9 (2.9) | 10 (3.2) | 8 (2.6) | |
| Adherence to nearby organs | 47 (14.9) | 49 (15.8) | 39 (12.8) | |
| Differentiation of tumor | ||||
| Well | 27 (8.6) | 37 (11.9) | 29 (9.5) | |
| Moderate | 236 (74.9) | 229 (73.9) | 221 (72.7) | |
| Poor | 52 (16.5) | 44 (14.2) | 54 (17.8) | |
| Extent of local spread | ||||
| Contiguous | 20 (6.3) | 12 (3.9) | 11 (3.6) | |
| Muscle | 38 (12.1) | 36 (11.6) | 32 (10.5) | |
| Serosa | 249 (79.0) | 259 (83.5) | 251 (82.6) | |
| Submucosa | 8 (2.5) | 3 (1.0) | 10 (3.3) | |
| More than 4 lymph nodes with cancer | Yes | 87 (27.6) | 89 (28.7) | 79 (26.0) |
| Short time from surgery to registration (%) | Yes | 91 (28.9) | 80 (25.8) | 76 (25.0) |
As shown in Table 2, the baseline patient characteristics were very similar between the three treatment groups, suggesting that differences in survival time between the groups are not likely attributed to differences at the start of the study.
IIIc. Survival curves
Figure 2: Censoring (left) or survival (right) time of patients
The median survival time of patients who were censored (left) or died (right) was 2,331 and 775 days, respectively. These results show that patients who were not censored died early during the follow-up period.
Stratified survival curves
#| echo: false
#| message: false
#| warning: false
#| include: true
f1 <- survfit(Surv(time, status) ~ adhere, data = df)
f2 <- survfit(Surv(time, status) ~ rx, data = df)
f3 <- survfit(Surv(time, status) ~ sex, data = df)
f4 <- survfit(Surv(time, status) ~ obstruct, data = df)
f5 <- survfit(Surv(time, status) ~ perfor, data = df)
f6 <- survfit(Surv(time, status) ~ node4, data = df)
f7 <- survfit(Surv(time, status) ~ differentiation, data = df)
f8 <- survfit(Surv(time, status) ~ local_spread, data = df)
f9 <- survfit(Surv(time, status) ~ surg, data = df)
fits <- list(adhere = f1, rx = f2, sex =f3, obstruct = f4 , perfor=f5 , node4 =f6 , differentiation =f7 , local_spread =f8 , surg= f9)
# Visualize
legend.title <- list("adhere", "rx", "sex", "obstruct", "perfor", "node4", "differentiation", "local_spread", "surg")
ggsurvplot_list(fits, df, pval=TRUE) #legend.title = legend.title,$adhere
$rx
$sex
$obstruct
$perfor
$node4
$differentiation
$local_spread
$surg
attr(,"class")
[1] "list" "ggsurvplot_list"
Figure 3: Survival curves stratified by categorical variables
There is a statistically significant difference in survival times between two or more groups being compared, as shown in the survival curves stratified by adherence to nearby organs, rx, obstruction, presence of four or more tumor-positive nodes, differentiation, local spread, and surgery. Survival time is not significantly different between males and females or among those with or without tumor perforation. The vertical lines on the Kaplan-Meier curve, which show censoring data points, are enriched after about 1,800 days, suggesting that many participants were lost to follow-up after this period.
Patient characteristics that showed significant differences in survival time between two categories are important predictors of survival time. However, the Kaplan-Meier curve does not adjust for other variables; therefore, some of the predictors may not be significant when included in the multivariate Cox regression model.
IIId. Cox regression models
Model_1: Base Model
Model_1 <- coxph(Surv(time, status) ~ 1, data = df)
c_index_model_1 <- concordance(Model_1)
cat("Concordance of the base model:",c_index_model_1$concordance)Concordance of the base model: 0.5
Model_2: Univariate analysis
# Univariate analysis
Model_2 <- coxph(Surv(time, status) ~ rx, data = df)
# Create the regression table and add concordance statistic
summary_table <- tbl_regression(Model_2, exponentiate = TRUE) %>%
add_glance_source_note(
label = list(concordance = "Concordance"),
include = c("concordance")
) %>%
modify_table_styling(
columns = p.value,
rows = p.value < 0.05,
text_format = "bold"
)
# Convert to gt table, increase font size, and adjust width
gt_table <- as_gt(summary_table) %>%
gt::tab_options(
table.font.size = px(17), # Increase font size
table.align = "center", # Align table to the left
table.width = pct(50) # Make the table wider
) %>%
gt::tab_header(
title = gt::md("**Table 3. Model_2: Univariate model**")
)
# Print the gt table
gt_table| Table 3. Model_2: Univariate model | |||
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.97 | 0.78, 1.21 | 0.8 |
| Lev+5FU | 0.69 | 0.55, 0.87 | 0.002 |
| Concordance = 0.536 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
The coefficient of Lev is not significant, suggesting that there is no evidence that this treatment affects survival time compared to observation. however Lev+5Fu is significant (p=0.00175), indicating that the treatment Lev +5Fu has a statistically significant effect on survival time compared to the reference group. The negative sign indicates that this treatment group has a lower hazard and likely a longer survival time. The hazard ratio for Lex+5FU (0.690), indicating the risk of death is about 31% lower compared to the observation group.
Model_2: Cox proportional hazard assumption test
zph_test <- cox.zph(Model_2)
# Convert the Schoenfeld residuals test results to a data frame
zph_df <- as.data.frame(zph_test$table)
zph_df$Variable <- rownames(zph_df)
zph_df <- zph_df %>%
mutate(
chisq = round(chisq, 3),
p = round(p, 3)
)
zph_df %>%
kbl(caption = "Table 4: Schoenfeld Residuals Test Results") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE, position = "center") %>%
row_spec(0, bold = TRUE, align = "center")| chisq | df | p | Variable | |
|---|---|---|---|---|
| rx | 1.481 | 2 | 0.477 | rx |
| GLOBAL | 1.481 | 2 | 0.477 | GLOBAL |
# plot the Schoenfeld residuals
plot(zph_test)Figure 4: Schoenfeld residal plot of rx variable included in Model_2
The Schoenfeld residal plot shows that the residuals are scattered randomly and the smooth trend line is horizontal near 0. This suggests that the hazard ratio for rx (treatment status) is constant over time and the proportional hazard assumption is met. The global p-value is >0.05, indicating that the the assumption is met.
Model_3: All predictors
# Full variables: All predictors
Model_3<- coxph(Surv(time, status) ~ rx+ age + sex + perfor + adhere + surg + obstruct + differentiation + node4+ local_spread, data = df)
summary_table <- tbl_regression(Model_3, exponentiate = TRUE) %>%
add_glance_source_note(
label = list(concordance = "Concordance"),
include = c("concordance")
) %>%
modify_table_styling(
columns = p.value,
rows = p.value < 0.05,
text_format = "bold"
)
gt_table <- as_gt(summary_table) %>%
tab_options(
table.font.size = px(17), # Reduce font size
table.align = "center", # Align table to the center
table.width = pct(50), # Make the table wider
data_row.padding = px(2) # Reduce row padding
) %>%
gt::tab_header(
title = gt::md("**Table 5. Model_3: All predictors**")
)
gt_table| Table 5. Model_3: All predictors | |||
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.98 | 0.79, 1.22 | 0.9 |
| Lev+5FU | 0.69 | 0.54, 0.87 | 0.002 |
| age | 1.01 | 1.00, 1.02 | 0.083 |
| sex | 1.04 | 0.86, 1.26 | 0.7 |
| perfor | 1.00 | 0.59, 1.70 | >0.9 |
| adhere | 1.18 | 0.92, 1.53 | 0.2 |
| surg | 1.27 | 1.03, 1.55 | 0.022 |
| obstruct | 1.33 | 1.06, 1.68 | 0.015 |
| differentiation | |||
| moderate | — | — | |
| poor | 1.43 | 1.13, 1.82 | 0.003 |
| well | 1.08 | 0.78, 1.50 | 0.6 |
| node4 | 2.55 | 2.10, 3.09 | <0.001 |
| local_spread | |||
| contiguous | — | — | |
| muscle | 0.39 | 0.23, 0.64 | <0.001 |
| serosa | 0.64 | 0.43, 0.94 | 0.023 |
| submucosa | 0.29 | 0.10, 0.83 | 0.021 |
| Concordance = 0.674 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
The summary of Model_3 shows that when all variables are included in the model, age, sex, perforation, and adherence are not significant predictors. In contrast, rx, surg, obstruct, differentiation, node4, and local spread are significant predictors. The concordance of the multivariable model (0.674) is higher than that of the univariate model (concordance = 0.53), suggesting that the multivariate model is a better fit. However, the model concordance did not improve when removing predictors that were not significant in Model_3.
Stepwise variable selection:
A step-wise variable selection approach was used to confirm that the significant predictors in Model_3 are the best set of predictors. The stepAIC() function from the MASS package was used for stepwise variable selection, utilizing the Akaike Information Criterion (AIC) to add or remove predictors from the model.
# Model_2 includes all variables
Model_2 <- coxph(Surv(time, status) ~ rx+ age + sex + perfor + adhere + surg + obstruct + differentiation + node4+
local_spread, data = df)
# stepwise selection
stepwise_model <- stepAIC(Model_2, direction = "both")Start: AIC=5741.4
Surv(time, status) ~ rx + age + sex + perfor + adhere + surg +
obstruct + differentiation + node4 + local_spread
Df AIC
- perfor 1 5739.4
- sex 1 5739.6
- adhere 1 5741.0
<none> 5741.4
- age 1 5742.5
- surg 1 5744.5
- obstruct 1 5745.1
- differentiation 2 5745.4
- rx 2 5749.9
- local_spread 3 5753.1
- node4 1 5822.0
Step: AIC=5739.4
Surv(time, status) ~ rx + age + sex + adhere + surg + obstruct +
differentiation + node4 + local_spread
Df AIC
- sex 1 5737.6
- adhere 1 5739.1
<none> 5739.4
- age 1 5740.5
+ perfor 1 5741.4
- surg 1 5742.5
- obstruct 1 5743.2
- differentiation 2 5743.5
- rx 2 5747.9
- local_spread 3 5751.1
- node4 1 5820.1
Step: AIC=5737.59
Surv(time, status) ~ rx + age + adhere + surg + obstruct + differentiation +
node4 + local_spread
Df AIC
- adhere 1 5737.3
<none> 5737.6
- age 1 5738.6
+ sex 1 5739.4
+ perfor 1 5739.6
- surg 1 5740.7
- obstruct 1 5741.2
- differentiation 2 5741.7
- rx 2 5746.2
- local_spread 3 5749.3
- node4 1 5818.2
Step: AIC=5737.26
Surv(time, status) ~ rx + age + surg + obstruct + differentiation +
node4 + local_spread
Df AIC
<none> 5737.3
+ adhere 1 5737.6
- age 1 5738.6
+ sex 1 5739.1
+ perfor 1 5739.2
- surg 1 5740.7
- obstruct 1 5740.9
- differentiation 2 5742.1
- rx 2 5746.0
- local_spread 3 5750.4
- node4 1 5817.4
# Extract the coefficients of the selected model
selected_variables <- coef(stepwise_model)
# Print the selected variables
print(selected_variables) rxLev rxLev+5FU age
-0.010789186 -0.375746378 0.007365529
surg obstruct differentiationpoor
0.243870576 0.283164609 0.373782987
differentiationwell node4 local_spreadmuscle
0.069021620 0.929854405 -0.995556083
local_spreadserosa local_spreadsubmucosa
-0.501168855 -1.322018421
Based on a step-wise selection, rx, age, surg, obstruct, differentiation, node4 and local spread were selected for inclusion in the final model. Age is a significant predictor in the step-wise selected variables, but not in the full variable model (Model_2).
Model_4: Stepwise-Selected variables
Model_4 <- coxph(Surv(time, status) ~ rx + age + surg + obstruct +
differentiation + node4 + local_spread, data = df)
summary_table <- tbl_regression(Model_4, exponentiate = TRUE) %>%
add_glance_source_note(
label = list(concordance = "Concordance"),
include = c("concordance")
) %>%
modify_table_styling(
columns = p.value,
rows = p.value < 0.05,
text_format = "bold"
)
gt_table <- as_gt(summary_table) %>%
tab_options(
table.font.size = px(17), # Reduce font size
table.align = "center", # Align table to the center
table.width = pct(50), # Make the table wider
data_row.padding = px(2) # Reduce row padding
) %>%
gt::tab_header(
title = gt::md("**Table 6. Model_4: Stepwise-selected variables**")
)
gt_table| Table 6. Model_4: Stepwise-selected variables | |||
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.99 | 0.80, 1.23 | >0.9 |
| Lev+5FU | 0.69 | 0.54, 0.87 | 0.002 |
| age | 1.01 | 1.00, 1.02 | 0.069 |
| surg | 1.28 | 1.04, 1.56 | 0.018 |
| obstruct | 1.33 | 1.06, 1.67 | 0.015 |
| differentiation | |||
| moderate | — | — | |
| poor | 1.45 | 1.15, 1.84 | 0.002 |
| well | 1.07 | 0.77, 1.48 | 0.7 |
| node4 | 2.53 | 2.09, 3.07 | <0.001 |
| local_spread | |||
| contiguous | — | — | |
| muscle | 0.37 | 0.23, 0.61 | <0.001 |
| serosa | 0.61 | 0.41, 0.89 | 0.010 |
| submucosa | 0.27 | 0.09, 0.76 | 0.014 |
| Concordance = 0.672 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
The concordance of Model_4 did not improve compared to Model_3, which included all predictors. However, Model_4 is less complex as it includes a smaller number of variables (see model comparison table below).
Model_4: Cox proportional hazard assumption test
# final model with step wise variable selection
zph_test <- cox.zph(Model_4)
# Convert the Schoenfeld residuals test results to a data frame
zph_df <- as.data.frame(zph_test$table)
zph_df$Variable <- rownames(zph_df)
zph_df <- zph_df %>%
mutate(
chisq = round(chisq, 3),
p = round(p, 3)
)
zph_df %>%
kbl(caption = "Table 7. Schoenfeld Residuals Test for Model_4") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE, position = "center") %>%
row_spec(0, bold = TRUE, align = "center")| chisq | df | p | Variable | |
|---|---|---|---|---|
| rx | 2.335 | 2 | 0.311 | rx |
| age | 0.549 | 1 | 0.459 | age |
| surg | 0.020 | 1 | 0.888 | surg |
| obstruct | 6.148 | 1 | 0.013 | obstruct |
| differentiation | 17.459 | 2 | 0.000 | differentiation |
| node4 | 5.662 | 1 | 0.017 | node4 |
| local_spread | 7.083 | 3 | 0.069 | local_spread |
| GLOBAL | 37.525 | 11 | 0.000 | GLOBAL |
Model_4 does not meet the Cox proportional hazards assumption, as indicated by p-values less than 0.05 in the Schoenfeld residuals test. The differentiation, node4, and obstruct variables did not meet the proportional hazards assumption, suggesting that the effects of these variables are not constant over time. To address this, a stratification approach will be used, where the model will be stratified by the variables that did not meet the proportional hazards assumption.
Model_5: Stratified Model
Model_5 <- coxph(Surv(time, status) ~ rx + age + surg + strata(obstruct) + strata(differentiation) + node4 +
local_spread, data = df)
summary_table <- tbl_regression(Model_5, exponentiate = TRUE) %>%
add_glance_source_note(
label = list(concordance = "Concordance"),
include = c("concordance")
) %>%
modify_table_styling(
columns = p.value,
rows = p.value < 0.05,
text_format = "bold"
)
gt_table <- as_gt(summary_table) %>%
tab_options(
table.font.size = px(17), # Reduce font size
table.align = "center", # Align table to the center
table.width = pct(50), # Make the table wider
data_row.padding = px(2) # Reduce row padding
)%>%
gt::tab_header(
title = gt::md("**Table 8. Model_5: Stratified Model**")
)
gt_table| Table 8. Model_5: Stratified Model | |||
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| rx | |||
| Obs | — | — | |
| Lev | 0.98 | 0.79, 1.22 | 0.9 |
| Lev+5FU | 0.71 | 0.56, 0.89 | 0.003 |
| age | 1.01 | 1.00, 1.02 | 0.034 |
| surg | 1.30 | 1.06, 1.59 | 0.012 |
| node4 | 2.50 | 2.06, 3.04 | <0.001 |
| local_spread | |||
| contiguous | — | — | |
| muscle | 0.34 | 0.21, 0.56 | <0.001 |
| serosa | 0.58 | 0.39, 0.84 | 0.004 |
| submucosa | 0.24 | 0.08, 0.67 | 0.007 |
| Concordance = 0.674 | |||
| 1 HR = Hazard Ratio, CI = Confidence Interval | |||
Model_5 Cox proportional hazard assumption test
zph_test <- cox.zph(Model_5)
# Convert the Schoenfeld residuals test results to a data frame
zph_df <- as.data.frame(zph_test$table)
zph_df$Variable <- rownames(zph_df)
zph_df <- zph_df %>%
mutate(
chisq = round(chisq, 3),
p = round(p, 3)
)
zph_df %>%
kbl(caption = "Table 9. Schoenfeld Residuals Test for Model_5") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE, position = "center") %>%
row_spec(0, bold = TRUE, align = "center")| chisq | df | p | Variable | |
|---|---|---|---|---|
| rx | 2.001 | 2 | 0.368 | rx |
| age | 0.670 | 1 | 0.413 | age |
| surg | 0.014 | 1 | 0.905 | surg |
| node4 | 4.288 | 1 | 0.038 | node4 |
| local_spread | 5.298 | 3 | 0.151 | local_spread |
| GLOBAL | 12.411 | 8 | 0.134 | GLOBAL |
# plot the Schoenfeld residuals
plot(zph_test)Figure 5: Schoenfeld residal plot of variables included in Model_5
After model stratification by obstruct and differentiation, the proportional hazards assumption is met, as the global p-value is greater than 0.05. Node4 slightly violates the assumption, but the final model is not stratified by node4 because the model concordance is attenuated when stratifying by node4.
Evaluate multicollinearity of variables in Model_5
vif <- vif(Model_5)
print(vif) GVIF Df GVIF^(1/(2*Df))
rx 1.008066 2 1.002010
age 1.019142 1 1.009525
surg 1.008430 1 1.004206
strata(obstruct) 3.071784 0 Inf
strata(differentiation) 3.071784 0 Inf
node4 1.023390 1 1.011628
local_spread 1.017073 3 1.002825
None of the variables has VIF >5, therefore multicollinearity is not a problem in Model_5
Model comparison
# Fit the models and store them in a list
models <- list(Model_1, Model_2, Model_3, Model_4, Model_5)
# Add descriptions for each model
descriptions <- c(
"Base model",
"Treatment",
"Full variables",
"Stepwise-selected variables",
"Stratified"
)
# Create a data frame to store results
results <- data.frame(
Model = character(),
Description = character(),
AIC = numeric(),
BIC = numeric(),
C_Index = numeric(),
stringsAsFactors = FALSE
)
# Function to calculate and store metrics for each model
for (i in seq_along(models)) {
model <- models[[i]]
# Extract AIC and BIC
aic <- AIC(model)
bic <- BIC(model)
# Add C-index
c_index <- concordance(model)$concordance
# Append results to the data frame
results <- rbind(results, data.frame(
Model = paste("Model", i),
Description = descriptions[i],
AIC = aic,
BIC = bic,
C_Index = round(c_index, 3)
))
}
# Print the table using kable and kableExtra
results %>%
kbl(caption = "Table 10. Model Evaluation Metrics") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE, position = "center") %>%
column_spec(1, bold = TRUE) %>%
column_spec(2:5, width = "10em") %>%
kable_styling(position = "center")| Model | Description | AIC | BIC | C_Index |
|---|---|---|---|---|
| Model 1 | Base model | 5860.383 | 5860.383 | 0.500 |
| Model 2 | Treatment | 5741.401 | 5798.993 | 0.674 |
| Model 3 | Full variables | 5741.401 | 5798.993 | 0.674 |
| Model 4 | Stepwise-selected variables | 5737.261 | 5782.511 | 0.672 |
| Model 5 | Stratified | 4567.829 | 4600.739 | 0.674 |
Based on the model evaluation metrics, Model_5 has the smallest AIC and BIC values and achieved the highest concordance (c-index). Although Model_2 and Model_5 have the same concordance, Model_5 is the best model as indicated by the smaller AIC and BIC values compared to Model_2. Five-fold cross-validation was performed to test the robustness of Model_5.
K-fold cross validation
set.seed(1234)
# Cox model
cox_model <- coxph(Surv(time, status) ~ rx + age + surg + strata(obstruct) + strata(differentiation) + node4 +
local_spread, data = df)
# calculate the original c-index
c_index_original <- concordance(cox_model)
cat("original c-index:", c_index_original$concordance, "\n")original c-index: 0.674316
# create a function for calculating c-index in each fold using concordance()
cox_cindex <- function(train_data, test_data) {
fit <- coxph(Surv(time, status) ~ rx + age + surg + strata(obstruct) + strata(differentiation) + node4 + local_spread, data = train_data)
# Calculate concordance on test data
c_index <- concordance(fit, newdata = test_data)$concordance
return(c_index)
}
# perform 5-fold cross-validation with stratification
K <- 5
folds <- createFolds(c(df$status, df$differentiation, df$rx), k = K, list = TRUE, returnTrain = TRUE)
cv_c_indices <- sapply(folds, function(train_indices) {
train_data <- df[train_indices, ]
test_data <- df[-train_indices, ]
cox_cindex(train_data, test_data) # use the concordance function inside cox_cindex
})
# Print cross-validated c-indices
print(cv_c_indices) Fold1 Fold2 Fold3 Fold4 Fold5
0.7156051 0.6605045 0.6562016 0.6477106 0.6943820
# cross-validation c-indices
cat("cross-validated c-Indices for each fold:", cv_c_indices, "\n")cross-validated c-Indices for each fold: 0.7156051 0.6605045 0.6562016 0.6477106 0.694382
cat("mean cross-validated c-Index:", mean(cv_c_indices), "\n")mean cross-validated c-Index: 0.6748808
# plot cross-validation c-indices
plot(cv_c_indices, type = "b", xlab = "Fold", ylab = "c-index", main = "c-index across folds")Figure 6: Concordance across the 5 folds
Model_5’s c-index (0.674) and mean cross-validation c-index (0.675) are very similar, suggesting that the final stratified model is stable and not overfitting.
IV. Conclusions
Treatment with Levamisole + 5-FU decreases the hazard of death from colon cancer by 29.5% (p=0.003).
Having more than 4 tumor positive lymph nodes significantly increases the hazard of death by 150% (p<0.0001).
Compared to patients with tumors that have spread to contiguous organs, those with tumors localized in the submucosa, muscle, and serosa have significantly reduced hazard rates by 76% (p = 0.007), 66% (p < 0.001), and 43% (p = 0.004), respectively.
Having long wait from surgery to trial registration increases the hazard rate by about 29.5% (p=0.012), suggesting that starting chemotherapy treatment within a short period of time after surgery improves survival.
For each additional year of age, the hazard rate increases by about 0.87% (p = 0.034), suggesting that age has minimal effect on survival.
The concordance of the model (0.674) indicates moderate predictive accuracy for survival time.
Other variables that were not included in the study may contribute to survival time.